skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Rangamani, Akshay"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. While previous optimization results have suggested that deep neural networks tend to favour low-rank weight matrices, the implications of this inductive bias on generalization bounds remain underexplored. In this paper, we apply a chain rule for Gaussian complexity (Maurer, 2016a) to analyze how low-rank layers in deep networks can prevent the accumulation of rank and dimensionality factors that typically multiply across layers. This approach yields generalization bounds for rank and spectral norm constrained networks. We compare our results to prior generalization bounds for deep networks, highlighting how deep networks with low-rank layers can achieve better generalization than those with full-rank layers. Additionally, we discuss how this framework provides new perspectives on the generalization capabilities of deep networks exhibiting neural collapse. Keywords: Gaussian complexity, Generalization bounds, Neural collapse, Low rank layers 
    more » « less
    Free, publicly-accessible full text available December 28, 2025
  2. It was always obvious that SGD with small minibatch size yields for neural networks much higher asymptotic fluctuations in the updates of the weight matrices than GD. It has also been often reported that SGD in deep RELU networks shows empirically a low-rank bias in the weight matrices. A recent theoretical analysis derived a bound on the rank and linked it to the size of the SGD fluctuations [25]. In this paper, we provide an empirical and theoretical analysis of the convergence of SGD vs GD, first for deep RELU networks and then for the case of linear regression, where sharper estimates can be obtained and which is of independent interest. In the linear case, we prove that the component $$W^\perp$$ of the matrix $$W$$ corresponding to the null space of the data matrix $$X$$ converges to zero for both SGD and GD, provided the regularization term is non-zero. Because of the larger number of updates required to go through all the training data, the convergence rate {\it per epoch} of these components is much faster for SGD than for GD. In practice, SGD has a much stronger bias than GD towards solutions for weight matrices $$W$$ with high fluctuations -- even when the choice of mini batches is deterministic -- and low rank, provided the initialization is from a random matrix. Thus SGD with non-zero regularization, shows the coupled phenomenon of asymptotic noise and a low-rank bias-- unlike GD. 
    more » « less
  3. In this paper, we conduct an empirical study of the feature learning process in deep classifiers. Recent research has identified a training phenomenon called Neural Collapse (NC), in which the top-layer feature embeddings of samples from the same class tend to concentrate around their means, and the top layer’s weights align with those features. Our study aims to investigate if these properties extend to intermediate layers. We empirically study the evolution of the covariance and mean of representations across different layers and show that as we move deeper into a trained neural network, the within-class covariance decreases relative to the between-class covariance. Additionally, we find that in the top layers, where the between-class covariance is dominant, the subspace spanned by the class means aligns with the subspace spanned by the most significant singular vector components of the weight matrix in the corresponding layer. Finally, we discuss the relationship between NC and Associative Memories (Willshaw et al., 1969). 
    more » « less
  4. In this paper, we study kernel ridge-less regression, including the case of interpolating solutions. We prove that maximizing the leave-one-out ([Formula: see text]) stability minimizes the expected error. Further, we also prove that the minimum norm solution — to which gradient algorithms are known to converge — is the most stable solution. More precisely, we show that the minimum norm interpolating solution minimizes a bound on [Formula: see text] stability, which in turn is controlled by the smallest singular value, hence the condition number, of the empirical kernel matrix. These quantities can be characterized in the asymptotic regime where both the dimension ([Formula: see text]) and cardinality ([Formula: see text]) of the data go to infinity (with [Formula: see text] as [Formula: see text]). Our results suggest that the property of [Formula: see text] stability of the learning algorithm with respect to perturbations of the training set may provide a more general framework than the classical theory of Empirical Risk Minimization (ERM). While ERM was developed to deal with the classical regime in which the architecture of the learning network is fixed and [Formula: see text], the modern regime focuses on interpolating regressors and overparameterized models, when both [Formula: see text] and [Formula: see text] go to infinity. Since the stability framework is known to be equivalent to the classical theory in the classical regime, our results here suggest that it may be interesting to extend it beyond kernel regression to other overparameterized algorithms such as deep networks. 
    more » « less
  5. We overview several properties—old and new—of training overparameterized deep networks under the square loss. We first consider a model of the dynamics of gradient flow under the square loss in deep homogeneous rectified linear unit networks. We study the convergence to a solution with the absolute minimumρ, which is the product of the Frobenius norms of each layer weight matrix, when normalization by Lagrange multipliers is used together with weight decay under different forms of gradient descent. A main property of the minimizers that bound their expected error for a specific network architecture isρ. In particular, we derive novel norm-based bounds for convolutional layers that are orders of magnitude better than classical bounds for dense networks. Next, we prove that quasi-interpolating solutions obtained by stochastic gradient descent in the presence of weight decay have a bias toward low-rank weight matrices, which should improve generalization. The same analysis predicts the existence of an inherent stochastic gradient descent noise for deep networks. In both cases, we verify our predictions experimentally. We then predict neural collapse and its properties without any specific assumption—unlike other published proofs. Our analysis supports the idea that the advantage of deep networks relative to other classifiers is greater for problems that are appropriate for sparse deep architectures such as convolutional neural networks. The reason is that compositionally sparse target functions can be approximated well by “sparse” deep networks without incurring in the curse of dimensionality. 
    more » « less
  6. We overview several properties -- old and new -- of training overparametrized deep networks under the square loss. We first consider a model of the dynamics of gradient flow under the square loss in deep homogeneous ReLU networks. We study the convergence to a solution with the absolute minimum $$\rho$$, which is the product of the Frobenius norms of each layer weight matrix, when normalization by Lagrange multipliers (LM) is used together with Weight Decay (WD) under different forms of gradient descent. A main property of the minimizers that bounds their expected error {\it for a specific network architecture} is $$\rho$$. In particular, we derive novel norm-based bounds for convolutional layers that are orders of magnitude better than classical bounds for dense networks. Next we prove that quasi-interpolating solutions obtained by Stochastic Gradient Descent (SGD) in the presence of WD have a bias towards low rank weight matrices -- that, as we also explain, should improve generalization. The same analysis predicts the existence of an inherent SGD noise for deep networks. In both cases, we verify our predictions experimentally. We then predict Neural Collapse and its properties without any specific assumption -- unlike other published proofs. Our analysis supports the idea that the advantage of deep networks relative to other classifiers is greater for the problems that are appropriate for sparse deep architectures such as CNNs. The deep reason compositionally sparse target functions can be approximated well by ``sparse'' deep networks without incurring in the curse of dimensionality. 
    more » « less
  7. Compressive video measurements can save bandwidth and data storage. However, conventional approaches to target detection require the compressive measurements to be reconstructed before any detectors are applied. This is not only time consuming but also may lose information in the reconstruction process. In this paper, we summarized the application of a recent approach to vehicle detection and classification directly in the compressive measurement domain to human targets. The raw videos were collected using a pixel-wise code exposure (PCE) camera, which condensed multiple frames into one frame. A combination of two deep learning-based algorithms (you only look once (YOLO) and residual network (ResNet)) was used for detection and confirmation. Optical and mid-wave infrared (MWIR) videos from a well-known database (SENSIAC) were used in our experiments. Extensive experiments demonstrated that the proposed framework was feasible for target detection up to 1500 m, but target confirmation needs more research. 
    more » « less
  8. The pixel-wise code exposure (PCE) camera is a compressive sensing camera that has several advantages, such as low power consumption and high compression ratio.Moreover, one notable advantage is the capability to control individual pixel exposure time. Conventional approaches of using PCE cameras involve a time-consuming and lossy process to reconstruct the original frames and then use those frames for target tracking and classification. Otherwise, conventional approaches will fail if compressive measurements are used. In this paper, we present a deep learning approach that directly performs target tracking and classification in the compressive measurement domain without any frame reconstruction. Our approach has two parts: tracking and classification. The tracking has been done via detection using You Only Look Once (YOLO), and the classification is achieved using residual network (ResNet). Extensive simulations using short-wave infrared (SWIR) videos demonstrated the efficacy of our proposed approach. 
    more » « less